Dyr og Data

Introduction

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-08-28

Welcome

Data science skills

We are all data scientists

We all should be data scientists

Dyr og data

Animals and data

Dyr og data is your first foray into learning key data science skills

Why?

Why?

  1. to help you complete your Animal Science degree
    • efficiently
    • reproducibly
  2. to prepare you for your future career
    • in industry
    • in another field
    • in academia
  3. to equip you as informed citizens

Dyr og Data

This is a course in applied data science

We’re not trying to turn to into computer scientists or statisticians

We do want you to use data science skills during your education & when you graduate

Course topics

  • Introduction

  • What is Data Science?

  • Data types, storage, security and ethics

  • Data handling and wrangling

  • Data Visualization

  • Descriptive and exploratory data analysis

  • Statistical thinking and ‘data literacy’

  • Dynamic reporting in document and presentation format

  • Databases

Learning objectives

At the end of the course

Knowledge
  • Describe and separate different data types and methods of data storage
  • Describe the visualization theory and grammar behind graphics, and apply both in the creation of data visualizations
  • Define and explain fundamental statistical concepts, and apply statistical thinking to make evidence-based decisions from data
Skills
  • Select and use methods for data handling of different types of data
  • Analyze data using descriptive statistics and exploratory data analysis and explain the results
  • Create dynamic reports of data, both as a document and as a presentation

Flipped classroom

This course is different

Mostly be working in groups during class time

Outside of class you will:

  • watch short lecture videos
  • read parts of the course texts
  • (later) work on your portfolio

Course texts

No free version 😭

We will use most of this book by the time you graduate

Physical copies in the bookstore

Won’t use it yet for a few weeks

R for Data Science (r4ds)

Free, online version — no need to buy

https://r4ds.hadley.nz/

Assessments

You will be assessed as passing the course, or not

Oral Examination

Student gives a 5 minute presentation on a randomly selected portfolio project (48 hrs prep)

Followed by 15 minutes of questions on the course syllabus

5 Portfolio projects

We’ll introduce these to you later in the course

Computing Environment

posit.cloud

Sign up for a free account — invites sent out this morning

Contacting us

By direct email:

  • gavin@anivet.au.dk (include Dyr og data in subject line)
  • mona@anivet.au.dk (include Dyr og data in subject line)

Expect a response within 48 hours (2 working days)

During the week responses usually within 24 hours

If you send an email after 4pm on Friday don’t expect a response until Monday at the earliest

Email to arrange a meeting as needed

Groups

Randomly assigned

Mona and have randomly assigned you to groups

From Friday please sit with your group

Groups

Group 1

  • Léa Tinch
  • Nanna Søgaard Jensen
  • Olivia Berg Jacobsen
  • Rebecca Graveson

Group 2

  • Christine Lykke Jessen
  • Line Hajslund Aarup
  • Mikkeline Høi Gottschalk
  • Mille Liv Søkær Laursen

Group 3

  • Ellen Dam Kristiansen
  • Julie Thulstrup Bruhn
  • Simone Holst Petersen
  • Julie Liv Bredesen

Group 4

  • Cecillie Højlund
  • Frederik Berg Olsson
  • Laura Elisabeth Westergaard Hansen
  • Marie Skov
  • Natashja Dahl Jakobsen

Group 5

  • Camilla Lyck Crawack
  • Emma Udkilt Jørgensen
  • Helle Skovgaard Andersen
  • Kirstine Krogager-Nielsen
  • Thomas Fly Christensen

What to bring to class

  1. Laptop!

  2. Textbook

Breaks

Tell us something about yourself

Data science

Learning objectives

At the end of this topic you should be able to

  • Articulate what data science is

  • Understand at a high level the steps involved in doing data science

  • Describe the roles and skills of a data scientist

What is data science?

Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets

Kelleher & Tierney, pp. 1

Related fields

  • Machine learning
  • Data mining

Data science is broader, borrowing from these fields and many other

What is data science?

image/svg+xml DomainExpertise Computing Statistics DataScience

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Actionable insight

Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets

Kelleher & Tierney, pp. 1

Data science outputs are only useful if we or others can make use of them

Insight

Does data science provide us with information that wasn’t obvious?

Actionable

Can we do something useful with the new information?

Example data science problems

Customer segmentation ⟶ clustering
  • find groups of individuals exhibiting similar behavior
Association rule mining
  • find groups of things that co-occur together
  • animals with similar sets of symptoms
Anomaly or outlier detection
  • identifying strange or abnormal events; e.g. fraudulent billing, disease, behaviour
Classification ⟶ prediction
  • develop models to predict some outcome — missing piece of data
  • predict disease from risk factors & test results
  • predict disease from a CT scan

Four “A”s of Data Science

  1. Data Architecture

  2. Data Acquisition

  3. Data Analysis

  4. Data Archiving

Data Architecture

Provide input on how data need to routed and organized to support the

  • analysis,

  • visualization, and

  • presentation of data

Data Acquisition

How should the data be collected and represented prior to analysis?

Important tasks that need to happen before data can be profitably analyzed are

  • representing data

  • transforming data

  • grouping

  • linking

Data Analysis

How can we summarize data?

Use samples of data to make inferences about the larger context or population

Visualize data and analysis outputs in graphs, tables, animations, dashboards

Communicate the results of the analysis

Data Archiving

How should we preserve data that has been collected?

What forms of the data need to be preserved

Difficult to anticipate future uses of data

Skills

Data science skills

image/svg+xml Communication DomainExpertise Data Ethics &Regulation Data Wrangling &Databases ComputerScience & HPC DataVisualization Statistics &Probability MachineLearning DataScientist

Domain expertise

Important to learn the application domain

Need to know enough to

  • understand the problem

  • understand why the problem is important

  • how data science might address the problem

Ethics and regulation

If data are important enough to collect, they’re important enough to affect people’s lives

Need to understand ethical issues

  • privacy, personal data-protection

  • biases in data & models

  • limitations of the data

  • prevent misuse

Data wrangling & databases

Working with data, files, & databases are essential skills

  • understand how data are stored

  • transform data

  • generate metadata

  • how to link data

  • query databases with & SQL

Computer science & HPC

Computer science & HPC provides algorithms & data structures to tackle increasingly large amounts of data

  • algorithms

  • distributed computing & map reduce

  • use computer clusters to parallelise operations

Data Visualization

Know how to present data in forms that are suitable and that aid decision making

  • theory behind perception

  • encoding data graphically

  • appropriate plots

  • grammar of graphics

  • create infographics

  • dashboards

Statistics & probability

Statistics is the field of science concerned with making inferences from samples of data drawn from larger populations

  • exploratory data analysis

  • summarize data

  • use statistical methods to make inferences

  • communicate results of statistical models

Machine learning

An offshoot from statistics (statistical learning) & computer science

  • underlying principals of machine learning methods

  • model assessment

  • variable importance

  • neural networks

  • tree-based models

  • prediction vs explanation

Communication

Communicating with end users, data generators, etc is an essential component of any applied science

Need to translate technical language of animal science, computer science, statistics, machine learning to the language used in specific domains

  • communicate with specialists

  • communicate with end users

  • aid decision making

  • communicate uncertainty

Before Friday’s Class

  • Read from r4ds

    • Introduction
  • Watch a short video about Posit.cloud

  • Watch a short video about running R code